Skip to content

[configure] Backend Performance Requirements for etcd#148

Open
jing2uo wants to merge 2 commits intomainfrom
kb/2026-04-21/backend-performance-requirements-for-etc
Open

[configure] Backend Performance Requirements for etcd#148
jing2uo wants to merge 2 commits intomainfrom
kb/2026-04-21/backend-performance-requirements-for-etc

Conversation

@jing2uo
Copy link
Copy Markdown
Collaborator

@jing2uo jing2uo commented Apr 22, 2026

新增一篇 ACP KB 文章,归入 configure 区域。

✅ 自动化验证通过 — 3 / 3 条验证步骤在真实 Kubernetes 集群上按文章命令跑通(2026-04-22T13:12:08Z)。

configure 区域建议 reviewer

kb/OWNERS.md + kb/KB_REVIEWERS.md 该区域的活跃人自动挑选,@ 错了请无视。

@changluyi @zhangzujian @oilbeater

没有 GitHub handle 的贡献者(本区域相关请人工 ping):

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Walkthrough

A new troubleshooting documentation page was added for etcd backend performance degradation, providing issue characterization, root-cause explanation, and resolution guidance with diagnostic procedures, commands, and monitoring thresholds.

Changes

Cohort / File(s) Summary
etcd Performance Troubleshooting Documentation
docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
New documentation page describing etcd backend performance degradation: issue symptoms with log message examples, root-cause analysis linking backend bottlenecks to missed heartbeats and slow requests, and resolution procedures including fio disk I/O benchmarking (p99 fdatasync latency target), Prometheus metric monitoring with p99 thresholds, network health checks (RTT/packet loss), optional database defragmentation via etcdctl, and diagnostic commands using kubectl and curl.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A doc for etcd's troubles is here,
Performance tips crystal and clear,
With benchmarks and thresholds to check,
No heartbeats shall go to heck! 💫

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: adding documentation about backend performance requirements for etcd, which matches the new file added and the PR's explicit objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kb/2026-04-21/backend-performance-requirements-for-etc

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jing2uo jing2uo requested a review from oilbeater April 22, 2026 07:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md`:
- Around line 64-72: Add an explicit safety note to the etcd defragmentation
snippet instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.
- Around line 13-18: The fenced log block is missing a language tag which
triggers markdownlint MD040; update the code fence that contains the lines
beginning with "etcdserver: failed to send out heartbeat..." and the subsequent
etcdserver/wal lines by adding a language identifier (e.g., "text") after the
opening ``` so the block reads like ```text to satisfy the linter and preserve
formatting.
- Around line 41-45: The fio benchmark currently uses the --fsync=1 flag which
measures fsync (data+metadata) but the documented SLA and WAL behavior require
measuring fdatasync; update the fio invocation in the docs (the command line
containing fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1
so the 99th-percentile fdatasync latency threshold (<10 ms) is validated
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e512c52c-72ca-4013-a6cf-cbe02c53c390

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0b251 and e0e0467.

📒 Files selected for processing (1)
  • docs/en/solutions/Backend_Performance_Requirements_for_etcd.md

Comment on lines +13 to +18
```
etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
etcdserver: server is likely overloaded
etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
wal: sync duration of xxxx s, expected less than 1s
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the log code fence.

The fenced block at Line 13 is missing a language identifier, which triggers markdownlint MD040.

✅ Suggested patch
-```
+```text
 etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
 etcdserver: server is likely overloaded
 etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
 wal: sync duration of xxxx s, expected less than 1s
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/en/solutions/Backend_Performance_Requirements_for_etcd.md around lines
13 - 18, The fenced log block is missing a language tag which triggers
markdownlint MD040; update the code fence that contains the lines beginning with
"etcdserver: failed to send out heartbeat..." and the subsequent etcdserver/wal
lines by adding a language identifier (e.g., "text") after the opening so the block reads liketext to satisfy the linter and preserve formatting.


</details>

<!-- fingerprinting:phantom:triton:hawk:3a43dc90-80f7-42a3-bcac-1dedf608d1de -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +41 to +45
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
```

The 99th percentile fdatasync latency must be under **10 ms**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In fio, what is the behavioral difference between --fsync=1 and --fdatasync=1, and which one best matches etcd WAL fdatasync latency validation guidance?

💡 Result:

In fio, --fsync=1 issues fsync(2) after every block written, syncing both data and metadata to disk. --fdatasync=1 issues fdatasync(2) after every block, syncing only data (not metadata unless required for data retrieval), falling back to fsync on unsupported platforms. --fdatasync=1 best matches etcd WAL fdatasync latency validation guidance. Etcd primarily uses fdatasync for WAL syncs (fsync only on create/close/repair/cut), and official validation uses fio with --fdatasync=1 (e.g., fio --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --size=XXm), checking 99th percentile fdatasync latency <10ms.

Citations:


🏁 Script executed:

fd "Backend_Performance_Requirements_for_etcd.md" --type f

Repository: alauda/knowledge

Length of output: 122


🏁 Script executed:

sed -n '41,45p' docs/en/solutions/Backend_Performance_Requirements_for_etcd.md

Repository: alauda/knowledge

Length of output: 266


Change --fsync=1 to --fdatasync=1 to match the documented threshold.

The fio command uses --fsync=1 (which syncs data and metadata), but the threshold requires fdatasync latency <10ms (which syncs data only). This mismatch means the benchmark measures the wrong primitive. etcd WAL operations primarily use fdatasync, not fsync. Update the command to --fdatasync=1 to correctly validate etcd performance.

Suggested patch
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
-    --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+    --rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
```
The 99th percentile fdatasync latency must be under **10 ms**.
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
41 - 45, The fio benchmark currently uses the --fsync=1 flag which measures
fsync (data+metadata) but the documented SLA and WAL behavior require measuring
fdatasync; update the fio invocation in the docs (the command line containing
fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1 so the
99th-percentile fdatasync latency threshold (<10 ms) is validated correctly.

Comment on lines +64 to +72
If the database size approaches the quota, perform manual defragmentation:

```bash
kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a defrag safety note (one member at a time).

This runbook should explicitly instruct sequential defragmentation (not all members concurrently) to reduce control-plane disruption risk.

✅ Suggested patch
 ### Database Defragmentation
 
 If the database size approaches the quota, perform manual defragmentation:
+Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.
 
 ```bash
 kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
If the database size approaches the quota, perform manual defragmentation:
```bash
kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
```
If the database size approaches the quota, perform manual defragmentation:
Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
64 - 72, Add an explicit safety note to the etcd defragmentation snippet
instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant